Precise longitudinal monitoring of memory operations

ABSTRACT

A processor includes a memory subsystem having a first memory subunit that includes a status register and an execution engine unit coupled to the memory subsystem. The execution engine unit is to: randomly select a load operation to monitor; determine a re-order buffer identifier of the load operation; and transmit the re-order buffer identifier to the memory subsystem. Responsive to receipt of the re-order buffer identifier, the first memory subunit stores a piece of information, related to a status of the load operation, in the status register. Responsive to detection of retirement of the load operation, the first memory subunit is to store the piece of information from the status register into a particular field of a record of a memory buffer, wherein the particular field is associated with the first memory subunit.

TECHNICAL FIELD

Embodiments of the disclosure relate generally to performancemonitoring, and more specifically, but without limitation, to preciselongitudinal monitoring of memory operations.

BACKGROUND

Performance analysis is the foundation for characterizing, debugging,and tuning a microarchitectural design, finding and fixing performancebottlenecks in hardware and software, as well as locating avoidableperformance issues. As the computer industry progresses, the ability toanalyze the performance of a microarchitecture and make changes to themicroarchitecture based on that analysis becomes more complex andimportant.

BRIEF DESCRIPTION OF THE DRAWINGS

The disclosure will be understood more fully from the detaileddescription given below and from the accompanying drawings of variousembodiments of the disclosure. The drawings, however, should not betaken to limit the disclosure to the specific embodiments, but are forexplanation and understanding only.

FIG. 1 is a block diagram of system microarchitecture according tovarious embodiments.

FIG. 2 is a graph that illustrates a problem of locating load blocks anddetermining penalties.

FIG. 3A is a block diagram illustrating microarchitecture for aprocessor according to an embodiment.

FIG. 3B is a block diagram illustrating an in-order pipeline and aregister renaming stage, out-of-order issue/execution pipeline accordingto an embodiment.

FIG. 4 is a block diagram illustrating system microarchitecture andfunctionality of longitudinal profiling of a load operation selected formonitoring according to an embodiment.

FIG. 5 is a flow chart of a method for precise longitudinal monitoringof memory load operations according to various embodiments.

FIG. 6 is a block diagram illustrating microarchitecture for a processorin accordance with one embodiment.

FIG. 7 is a block diagram illustrating a system in which an embodimentof the disclosure may be used.

FIG. 8 is a block diagram illustrating a system in which an embodimentof the disclosure may operate.

FIG. 9 is a block diagram illustrating a system in which an embodimentof the disclosure may operate.

FIG. 10 is a block diagram illustrating a System-on-a-Chip (SoC)according to an embodiment.

FIG. 11 is a block diagram illustrating a SoC design according to anembodiment.

FIG. 12 illustrates a block diagram illustrating a computer systemaccording to an embodiment.

DETAILED DESCRIPTION

The embodiments described herein are directed to performance monitoring(PerfMon), also referred to as profiling, of micro-architectural designto characterize, debug, and tune the design, find and fix performancebottlenecks in hardware and software, as well as locate avoidableperformance issues. Performance monitoring generally seeks to count someevent (e.g., cache access) and tag that information to particularinstructions to which software can relate, e.g., particular memoryoperations performed in association these instructions.

In one embodiment, Precise Event Based Sampling (PEBS) is a featureavailable to a subset of events that allows the hardware to collectadditional information very close to the exact time the configured eventoverflowed. Monitoring based on PEBS takes an arbitrary point in timeaccess and tries to see what instructions are being executed at thattime that may be contributing to a particular event. This is known asvertical profiling. The present disclosure relates to horizontalprofiling, e.g., randomly selecting an instruction from an instructionstream, and saving information about what that instruction does and howother related or dependent instructions may impact the instruction astime passes.

The disclosed microarchitecture and methods may provide performancemonitoring information including stall reasons and penalties of memoryaccesses, with respect to loads in particular, in high-performanceout-of-order (OOO) cores. Such information can map offending memoryaccesses to the precise instructions triggering the offending memoryaccesses. This information can aid performance engineers to optimize andtune performance of demanding workloads on multi-core platforms.Optimization efforts have increased as more speedup is derived frommicroarchitecture design and software tuning.

For example, there are a number of challenges with identifying andfixing load blocks from being forwarded from an earlier store inperformance monitoring technology. It is difficult to identify the exactload that is blocked because the load block events tend to skid fewcycles away from the problematic load, as illustrated in FIG. 2. Forexample, in a skid of eight cycles there can be up to 40 load operationsin a five-wide processor core. FIG. 2 illustrates that the load which isblocked occurs three to eight instructions before the event thatidentifies (e.g., tags) the store forward blocked due to skid. Skid inPEBS refers to delay in time between stopping the processor andrecording the state information from the processor. There may be manyother load operations within the three-to-eight-instruction window thatmay incur the store forward blocked penalty. It is therefore difficultto identify the store operation involved in the load blocked case.

In order to fix the load block case, it may be necessary to understandnot only the location of the load that is blocked but also the storethat is responsible for the load being blocked. It is difficult todetermine the cost of load block cases such as due to a store forward,as the penalty may be partially reliant on the latency of the store. Forexample, a store that misses in the last level cache can push out theblocked load until the cache line containing the store is fetched frommemory. There is a high hardware cost in an implementation that aims tomake all of the various load block cases precise using traditionalprecise mechanisms such as PEBS. Furthermore, non-precise performancemonitoring and static software analysis are also insufficient.

More specifically, non-precise events may lead to error in detecting thecorrect load operation due to skid as well as speculative accesses. Theevent may not be tagged to the right instructions in such cases. Findingstore forward blocks statically in code identifies only a subset of thecases that are impacting performance. Software-based solutions may alsofail whenever a microarchitecture condition is involved in blocking loadoperations, e.g., if some store address was not yet resolved by the timethe load operation is executed. Turning the current load block fromnon-precise events into precise events would have some addeddisadvantages in that doing so may fail to account for the latency orcost of the load block.

There are also hardware implementation costs for making additional PEBSevents. For example, various block and latency states would have to betracked in critical structures, like in the re-order buffer, formicro-operations (μops).

Simultaneous collections of several events may present additionalchallenges. For example, much information may be needed to properlyanalyze a performance issue, such as a load address, a data-source(e.g., L2 hit, L3 hit, and the like), a block condition (store forward,unknown data, unknown address) and other information (such as existenceof a lock, split, translation lookaside buffer information, and thelike). Due to shortages in the number of general counters, event-basedsolutions (precise or not), mandate software tools do multiple runs ofthe workload or counter multiplexing, compromising the fidelity of theprofiling data.

To resolve these challenges, the disclosed microarchitecture may employprocessor units that record various data for a monitored transaction,e.g., precise longitudinal monitoring of memory loads. A singletransaction may be randomly selected for monitoring. An out-of-orderunit may track instruction latency in cycles, an address generation unitrecord a data load address (DLA) into a register, a data cache unit(DCU) may track cache latency, a memory ordering buffer (MOB) may trackblock conditions, and other memory subunits may track additionalinformation. A profiling tool (e.g., a software tool) may sample andattribute the information to particular instructions. For example, astore-unknown-address-block bit may inform whether a load operation at agiven instruction pointer (e.g., EventingIP or Instruction Pointer inPEBS) cannot forward from an earlier store operation. The DLA of theload operation may help to identify the particular offending storeoperation. The difference of instruction latency and the cache latencymay determine the penalty of that block. This allowsperformance-critical contentions to be identified and fixed in software(e.g., by prioritizing the load operation above the store operation).Additional sources of information and data will be discussed.

In one embodiment, a processor may include a memory subsystem withmultiple memory subunits, each which includes a status register. Anexecution engine unit may be coupled to the memory subsystem and beadapted to: randomly select a load operation to monitor; determine are-order buffer identifier of the load operation; and transmit there-order buffer identifier to the memory subsystem. Responsive toreceipt of the re-order buffer identifier, each memory subunit of thememory subsystem may store a piece of information, related to a statusof the load operation, in its status register. Such pieces ofinformation were just discussed by way of example. The processor mayfurther, in response to detection of the retirement of the loadoperation, retrieve the pieces of information from the various statusregisters and store each piece of information from corresponding statusregisters into a particular field of a record of a memory buffer. In oneembodiment, the processor also checks that the load operation hasundergone threshold latency in execution before retrieving and storingthe pieces of information into the fields of the record of the memorybuffer. The particular field may be associated with a correspondingmemory subunit or a particular type of information obtained from thecorresponding memory subunit. Additional information may also be writteninto the record such as a counter value for an instruction latencycounter and a data access address of the load operation. See Table 1 fora more complete list of information that may be stored in a record ofthe memory buffer.

In various embodiments, the disclosed microarchitecture and methods forlongitudinal monitoring of memory operations may provide the exactinstruction pointer (without skid of the load operation) that is blockeddue to the precise nature of a load latency event. A priori randomselection of the load operation may help to avoid skid (or bias) fromcounter overflow until retirement is stopped or if multiple μops retirein a given cycle.

Furthermore, the disclosed microarchitecture and methods may collectmuch different information for one transaction. In PEBS, a programmermay choose the event to focus on and collect just this information formany instructions. If a first load operation experienced an event,results of the PEBS monitoring may not be able to tell if another loadoperation experienced that same event, or determine the latency of thefirst load operation.

The disclosed microarchitecture and methods may also identify aparticular address for the instruction, so the microarchitecture maydetermine what other instruction last wrote to this address. With thisinformation, the microarchitecture may then see dependencies betweeninstructions. For example, the microarchitecture may include means todetermine the store operation which generated the load block situationthrough investigating the DLA as well as addresses of the registers thatare included in the record of the memory buffer.

The latencies of the load may be determined by being integrated into aload-latency facility, where instruction latency measures overall timeincluding instruction dependencies and memory ordering checks. In oneembodiment, the load-latency facility is hardware that records latencyof a load and that may, as a result, estimate from where the load isarriving, e.g., a particular level of cache or from memory, or the like,associated with a particular latency. Cache latency may measure only thememory subsystem time for serving that request. The difference betweenthe instruction (e.g., load) latency and the cache latency may be usefulin determining memory operation blocks for which no cache misses areinvolved.

Furthermore, the disclosed microarchitecture and methods may incur lowerhardware implementation costs by employment of longitudinal (as opposedto event-based) profiling. The microarchitecture and methods may enabletracking a single transaction at a time with distributed recording ofmonitoring information. Hence, there is no requirement for a per-entrystate in critical processor structures, for at-retirement tagging,and/or for expensive mechanisms to avoid skid.

Additionally, the disclosed microarchitecture and methods may enjoyatomicity and fidelity of the profiling data as the information iscoherent and relates to a particular transaction as an instruction isexecuted. In contrast, event-based sampling may collect informationacross different runs or stitch information from different transactionsdue to counter multiplexing.

FIG. 1 is a block diagram of system microarchitecture 100 that iscapable of precise longitudinal monitoring of memory operationsaccording to various embodiments. In an embodiment, the systemmicroarchitecture 100 is a processor, a system-on-a-chip (SoC), or otherprocessing device, which may be implemented on a single die (a samesubstrate) and within a single semiconductor package. The systemmicroarchitecture 100 may be instantiated as a central processing unit(CPU), a graphics processing unit (GPU), or the like.

Referring to FIG. 1, the system microarchitecture 100 may includemultiple cores, of which a processor core 102 is represented by way ofexplanation, and memory 110. The processor core 102 may contribute toout-of-order (000) processing clusters of the system microarchitecture100. In various embodiments, the processor core 102 includes a memorybuffer 114, which includes multiple performance monitoring records 116(e.g., memory buffer records containing performance monitoring data),and optional microcode 120 executable by the processor core 102 (orother logic) to populate the memory buffer 114 and interface withsoftware. The memory buffer 114 may be computer storage expected to bepresent on the processor core 102, whether volatile or non-volatile,persistent or non-persistent, random access memory (RAM), flash memory,or the like. In an alternative embodiment, at least a part of the memorybuffer 114 is stored in the off-chip memory 110. The processor core 102may further include a front end unit 130 for branch prediction,instruction cache, instruction fetch, and that may include a decode unitto decode fetched instructions, as will be discussed in more detail withreference to FIG. 3.

With continued reference to FIG. 1, the system microarchitecture 100 mayfurther include an execution engine unit 150 and a memory subsystem 170.The execution engine unit 150 may include a number of components thatwill be discussed in more detail with reference to FIG. 3, and mayinclude a unified scheduler 151, also known as a reservation station(RS), and a retirement unit, also known as a reorder buffer (ROB) 154.The unified scheduler 151 may be a decentralized feature of themicroarchitecture of a CPU that allows for register renaming, and may beused by the Tomasulo algorithm for dynamic instruction scheduling. TheROB 154 may reorder instructions that retire into program order, so thatdespite some instructions being executed out of order, data that resultfrom their execution is reordered properly. Additional or differentexecution subunits may also make up the execution engine unit 150.

In various embodiments, the memory subsystem 170 includes multiplememory subunits 172, including a first memory subunit 172A, a secondmemory subunit 172B, a third memory subunit 172C, and so forth until anNth memory subunit 172N. Each memory subunit may include a statusregister (e.g., a temporal register, a scratch control register (SCR),or the like), respectively a first status register 174A, a second statusregister 174B, a third status register 174C, and so forth until an Nthstatus register 174N. What these memory subunits may represent will bediscussed in more detail with reference to FIGS. 2, 3A-3B, and 4. Onewill appreciate that there may be more or fewer than the number ofmemory subunits depicted in FIG. 1, as these are illustrated merely byway of example and for purpose of explanation.

In one embodiment, the execution engine unit 150 may be coupled to thememory subsystem 170 and be adapted to: randomly select a load operationto monitor; determine a re-order buffer identifier of the loadoperation; and transmit the re-order buffer identifier to the memorysubsystem 170. Responsive to receipt of the re-order buffer identifier,each memory subunit 172A, 172B, 172C, . . . 172N may store a piece ofinformation, related to a status of the load operation, in its statusregister 174A, 174B, 174C, . . . 174N, respectively. The processor core102 may further execute the microcode 120 (or other logic) to: detectretirement of the load operation; and, in response to detection of theretirement of the load operation, store each piece of information fromcorresponding status registers 174A, 174B, 174C, . . . 174N into aparticular field of a record of the memory buffer 114, e.g., one of theperformance monitoring records 116. In an alternative embodiment nomicrocode is executed and so the respective memory subunits may detectretirement of the load operation and directly store each piece ofinformation from corresponding status registers into the particularfield of the record of the memory buffer 114.

In embodiments, the particular field of the performance monitoringrecord 116 may be associated with a corresponding memory subunit or aparticular type of information obtained from the corresponding memorysubunit, as will be discussed with more detail with reference to FIGS.4-5. Additional information may also be written into the performancemonitoring record 116 such as a counter value for an instruction latencycounter and a data access address of the load operation. Table 1contains a more complete list of information that may be stored in aperformance monitoring record 116 of the memory buffer 114, althoughadditional or different information or data may be stored in alternativeembodiments.

TABLE 1 Example Performance Monitoring Record Off- Group set name Fieldname Bits Details 0x0 Basic Record Format [47:0] Record Size [63:48] 0x8Instruction EventingIP Pointer 0x10 TSC 0x18 Applicable Counters 0x20Memory Access Address DLA (DLA) 0x28 Auxiliary Info [3:0] DATA_SRC (AUX)[5][4] Lock, DTLB-miss [6] STORE_FWD_BLK [7] STORE_ADDR_BLK 0x30 AccessLatency [15:0] Instruction Latency 0x34 [47:32] Cache Latency 0x38 TSXInfo TSX Information

The information or data stored within the exemplary record of Table 1,which may be stored in the memory buffer 114, includes basic data andmemory-related data. The basic data may include record format and recordsize, the instruction pointer of the instruction (e.g. memoryoperation), a time stamp counter (TSC) value, and additional applicablecounters. The memory-related fields may include an access address (e.g.,a data load address, or DLA), auxiliary information, access latency, andother transaction information related to Transaction SynchronizationExtensions (TSX) architecture. The auxiliary information may includedata stored in a scratch control register (SRC), lock data orindications of translation lookaside buffer (DTLB) misses/hits, whetherthere has been a store forward block of a load operation(STORE_FWD_BLK), and whether there has been an unknown store addressblock of a load operation (STORE_ADDR_BLK), both of which will bediscussed in more detail. For example, the piece of information may bewhether the load operation is blocked due to an address collision withan earlier store operation. The access latency information, stored inaccess latency fields, may include a value for instruction latency(e.g., an instruction latency value) and a value for cache latency(e.g., a cache latency value). The difference between the instructionlatency and the cache latency may account for delay due to a blockevent.

Accordingly, engineers may inspect latencies for different instructions,and make comparisons between these latencies. If engineers seehigher-than-expected latencies, and correlate the block (BLK) bits ofthe auxiliary information in the record of Table 1, then one maydetermine the kind of block or instructions that may be causing thelatency. Certain types of block events may explain corresponding casesof higher latencies, as discussed herein.

FIG. 3A is a block diagram illustrating microarchitecture for aprocessor core 300 that implements the processing device includingheterogeneous cores in accordance with one embodiment. Specifically, theprocessor core 300 depicts an in-order architecture core and a registerrenaming logic, out-of-order issue/execution logic to be included in aprocessor according to at least one embodiment of the disclosure. In oneembodiment, the processor core 300 is an extension or more-detailedversion of the processor core 102 of FIG. 1.

In various embodiment, the processor core 300 includes a front end unit330 coupled to an execution engine unit 350, and both are coupled to amemory unit 370. The processor core 300 may include a reducedinstruction set computing (RISC) core, a complex instruction setcomputing (CISC) core, a very long instruction word (VLIW) core, or ahybrid or alternative core type. As yet another option, the processorcore 300 may include a special-purpose core, such as, for example, anetwork or communication core, compression engine, graphics core, or thelike. In one embodiment, the processor core 300 may be a multi-coreprocessor or may be part of a multi-processor system.

In embodiments, the front end unit 330 includes a branch prediction unit332 coupled to an instruction cache unit 334, which is coupled to aninstruction translation lookaside buffer (TLB) 336, which is coupled toan instruction fetch unit 338, which is coupled to a decode unit 340.The decode unit 340 (also known as a decoder) may decode instructions,and generate as an output one or more micro-operations, micro-code entrypoints, microinstructions, other instructions, or other control signals,which are decoded from, or which otherwise reflect, or are derived from,the original instructions. The decoder 340 may be implemented usingvarious different mechanisms. Examples of suitable mechanisms include,but are not limited to, look-up tables, hardware implementations,programmable logic arrays (PLAs), microcode read only memories (ROMs),and the like. The instruction cache unit 334 is further coupled to thememory unit 370. The decode unit 340 is coupled to a rename/allocatorunit 352 in the execution engine unit 350.

In embodiments, the execution engine unit 350 includes therename/allocator unit 352 coupled to a retirement unit 354, also knownas a re-order buffer (ROB), and a set of one or more scheduler unit(s)356. The scheduler unit(s) 356 represents any number of differentschedulers, including reservations stations (RS), central instructionwindow, and the like. In one embodiment, the rename/allocator unit 352and the scheduler unit(s) 356 may together perform the function of theunified scheduler 151 of FIG. 1. The scheduler unit(s) 356 may becoupled to the physical register file(s) unit(s) 358. Each of thephysical register file(s) units 358 may represent one or more physicalregister files, different ones of which store one or more different datatypes, such as scalar integer, scalar floating point, packed integer,packed floating point, vector integer, vector floating point, etc.,status (e.g., an instruction pointer that is the address of the nextinstruction to be executed), etc. The physical register file(s) unit(s)358 may be overlapped by the retirement unit 354 to illustrate variousways in which register renaming and out-of-order execution may beimplemented (e.g., using a re-order buffer and a retirement registerfile(s), using a future file(s), a history buffer(s), and a retirementregister file(s); using a register maps and a pool of registers; and thelike.).

Generally, the architectural registers are visible from the outside of aprocessor or from a programmer's perspective. The registers are notlimited to any known particular type of circuit. Various different typesof registers are suitable as long as they are capable of storing andproviding data as described herein. Examples of suitable registersinclude, but are not limited to, dedicated physical registers,dynamically allocated physical registers using register renaming,combinations of dedicated and dynamically allocated physical registers,and the like. The retirement unit 354 and the physical register file(s)unit(s) 358 are coupled to the execution cluster(s) 360. The executioncluster(s) 360 may include a set of one or more execution units 362 anda set of one or more memory access units 364. The execution units 362may perform various operations (e.g., shifts, addition, subtraction,multiplication) and operate on various types of data (e.g., scalarfloating point, packed integer, packed floating point, vector integer,vector floating point).

While some embodiments may include a number of execution units dedicatedto specific functions or sets of functions, other embodiments mayinclude only one execution unit or multiple execution units that allperform all functions. The scheduler unit(s) 356, physical registerfile(s) unit(s) 358, and execution cluster(s) 360 are shown as beingpossibly plural because certain embodiments create separate pipelinesfor certain types of data/operations (e.g., a scalar integer pipeline, ascalar floating point/packed integer/packed floating point/vectorinteger/vector floating point pipeline, and/or a memory access pipelinethat each have their own scheduler unit, physical register file(s) unit,and/or execution cluster—and in the case of a separate memory accesspipeline, certain embodiments are implemented in which only theexecution cluster of this pipeline has the memory access unit(s) 364).It should also be understood that where separate pipelines are used, oneor more of these pipelines may be out-of-order issue/execution and therest may be in-order.

The set of memory access units 364 may be coupled to the memory unit370, which may include a data prefetcher 380, a data TLB unit 372 (e.g.,DTLB), a data cache unit (DCU) 374, and a level 2 (L2) cache unit 376,to name a few examples. In some embodiments the DCU 374 is also known asa first level data cache (L1 cache). The DCU 374 may handle multipleoutstanding cache misses and continue to service incoming stores andloads. It may also support maintaining cache coherency. The data TLBunit 372 may be a cache used to improve virtual address translationspeed by mapping virtual and physical address spaces. In one exemplaryembodiment, the memory access units 364 may include a load unit, a storeaddress unit, and a store data unit, each of which is coupled to thedata TLB unit 372 in the memory unit 370. The L2 cache unit 376 may becoupled to one or more other levels of cache and eventually to a mainmemory, e.g., the memory 110 of FIG. 1.

In one embodiment, the data prefetcher 380 speculativelyloads/prefetches data to the DCU 374 by automatically predicting whichdata a program is about to consume. Prefeteching may refer totransferring data stored in one memory location of a memory hierarchy(e.g., lower level caches or memory) to a higher-level memory locationthat is closer (e.g., yields lower access latency) to the processorbefore the data is actually demanded by the processor. Morespecifically, prefetching may refer to the early retrieval of data fromone of the lower level caches/memory to a data cache and/or prefetchbuffer before the processor issues a demand for the specific data beingreturned.

The processor core 300 may support one or more instructions sets (e.g.,the x86 instruction set (with some extensions that have been added withnewer versions); the MIPS instruction set of MIPS Technologies ofSunnyvale, Calif.; the ARM instruction set (with optional additionalextensions such as NEON) of ARM Holdings of Sunnyvale, Calif.).

It should be understood that the core may support multithreading(executing two or more parallel sets of operations or threads), and maydo so in a variety of ways including time sliced multithreading,simultaneous multithreading (where a single physical core provides alogical core for each of the threads that physical core issimultaneously multithreading), or a combination thereof (e.g., timesliced fetching and decoding and simultaneous multithreading thereaftersuch as in the Intel® Hyperthreading technology).

While register renaming is described in the context of out-of-orderexecution, it should be understood that register renaming may be used inan in-order architecture as well. While the illustrated embodiment ofthe processor core 300 also includes a separate instruction and datacache units and a shared L2 cache unit, alternative embodiments may havea single internal cache for both instructions and data, such as, forexample, a Level 1 (L1) internal cache, or multiple levels of internalcache. In some embodiments, the system may include a combination of aninternal cache and an external cache that is external to the core and/orthe processor. Alternatively, all of the cache may be external to thecore and/or the processor.

FIG. 3B is a block diagram illustrating an in-order pipeline and aregister renaming stage, out-of-order issue/execution pipelineimplemented by processor core 300 of FIG. 3A according to someembodiments. The solid lined boxes in FIG. 3B illustrate an in-orderpipeline, while the dashed lined boxes illustrates a register renaming,out-of-order issue/execution pipeline. In FIG. 3B, a processor core 300as a pipeline includes a fetch stage 302, a length decode stage 304, adecode stage 306, an allocation stage 308, a renaming stage 310, ascheduling (also known as a dispatch or issue) stage 312, a registerread/memory read stage 314, an execute stage 316, a write back/memorywrite stage 318, an exception handling stage 322, and a commit stage324. In some embodiments, the ordering of stages 302-324 may bedifferent than illustrated and are not limited to the specific orderingshown in FIG. 3B.

FIG. 4 is a block diagram illustrating system microarchitecture 400 andfunctionality of longitudinal profiling of a load operation selected formonitoring according to an embodiment. Components of the systemmicroarchitecture 400 of FIG. 4 may carry corresponding numbering tothose of the system microarchitecture 100 of FIG. 1, and thus theprocessor core 102 may be understood to include the systemmicroarchitecture 400 in some embodiments. The system microarchitecture400 may include an execution engine unit 450 that is to interface withand monitor memory operations passing through a memory subsystem 472.

In one embodiment, the execution engine unit 450 may include a re-orderbuffer (ROB) 154 and a unified scheduler 451. The ROB 454 may include alinear feedback shift register (LSFR) 428 (or other random numbergenerator) that may generate a random number. If the random numbermatches the current cycle number, which is the current stage in thehardware pipeline, then the ROB 454 may select the load operation thatoccurs within the instruction in that cycle (432). This selection mayoccur once per thread that the processor core 102 is executing (or moreoften in other embodiments). The unified scheduler 451 has scheduled theload operation, which may dispatch the load operation by function of anAND gate 452 within the unified scheduler 451. The execution engine unit450 may also determine a re-order buffer identifier (ROB ID) of the loadoperation, and transmit the ROB ID to the memory subsystem 472. It is byreference to the ROB ID that the execution engine unit 450 can monitorprogress and completion of the load operation. The ROB 454 may furtherinclude an instruction latency counter 436, which may be started whenthe load operation is dispatched, and thus has begun to be processed.

The memory subsystem 472 may include a number of memory subunits, as inthe system microarchitecture 100 of FIG. 1. By way of example, and forpurposes of explanation, the memory subsystem 472 may include a memoryordering buffer 472A, a data TLB (DTLB) 472B, and a DCU 472C, which mayin turn include a first status register 474A, a second status register474B, and a third status register 474C, respectively. Additional and/ordifferent memory subunits are envisioned.

In various embodiments, the memory subunits may receive or detectcertain pieces of information (which may include certain events) thatprovide a status of the load operation during execution. These pieces ofinformation are identified as first data 476A, second data 476B, andthird data 476C, corresponding respectively to the MOB 472A, the DTLB472B, and the DCU 472C. For example, the first data 476A for the MOB472A may include detection of an incomplete overlap between the loadoperation and a store operation on which the load operation isdependent, or detection of an unknown store address. Furthermore, thesecond data 476B of the DTLB 472B may include one of presence or absenceof a DTLB miss or a DTLB hit. Additionally, the third data 476C of theDCU 472C may include cache latency. The DCU 472C may include one or morefill buffer(s) that may be monitored in certain ways (to be discussedbelow) that impact the cache latency. These pieces of information,including the first data 476A, the second data 476B, and the third data476C may be recognized among the data stored to the performancemonitoring record such as illustrated in Table 1.

In one embodiment, responsive to receipt of the re-order bufferidentifier (ROB ID), the MOB 472A may store the first data 476B, relatedto a status of the load operation, in the first status register 474A;the DTLB 472B may store the second data 476B in the second statusregister 474B; and the DCU 472C may store the third data 476C in thethird status register 476C. In this way, the pieces of informationassociated with the status of the load operation for each of themultiple memory subunits are temporarily stored as performancemonitoring data in respective status registers.

With continued reference to FIG. 4, the ROB ID of the load operation maybe returned from the unified scheduler 451 upon being randomly selectedfor monitoring, which is also retained within the ROB 454 awaitingretirement, e.g., in advance of detection of the write back of the loadoperation (438). When the write back to the ROB 454 of the loadoperation occurs, indicating completion of the load operation (437), theROB 454 may determine whether the ROB ID of the write back is a matchfor the monitored load and for which the instruction latency counter 436was initiated (440). If so, the ROB 454 may stop the instruction latencycounter (442). The counter value of the instruction latency counter 436may now reflect the instruction latency for the monitored loadoperation.

In embodiments, the processor core 102 may execute microcode 420 (orother logic) to detect retirement of the load operation (449). Theprocessor core 102 may further execute the microcode 420 (or otherlogic) to, in response to detection of the retirement of the loadoperation, store the piece of information (or data) from each of thefirst status register 474A, second status register 474B, and thirdstatus register 474C, as well as the counter value from the instructionlatency counter 436, into corresponding fields of a performancemonitoring record 416 of a memory buffer 414 (see Table 1). Each fieldmay correspond to a type of data stored in each respective statusregister, for example. A performance monitoring tool that the processorcore executes as software may then retrieve the data from theperformance monitoring record 416. In an alternative embodiment, nomicrocode is executed as the memory subunits and the ROB 454 (andpossibly other hardware that buffers such pieces of information) may beconfigured to directly detect the retirement of the load operation andstore the pieces of information into respective fields of theperformance monitoring record 416.

In various embodiments, there may be two sample precise block eventsthat are of particular focus. Upon detection that the load operation isblocked by a preceding store forward operation with an overlappinglinear address (e.g., when a LD_BLOCKS.STORE_FORWARD event fires), theMOB 472A may set a bit of the first status register 474A. In thisscenario, the overlapping linear address may prevent the store operationto forward the data required by the load operation. In a second preciseblock event, upon detection that the load operation is blocked by anunknown linear store address (e.g., when a SB_BLOCKS.STORE_ADDR_BLKevent fires, where “SB” stands for store buffer), the MOB 472A may set adifferent bit (e.g., bit 23 that may be called STORE_ADDR_BLK) in thefirst status register 474A. This second scenario may also arise ifmemory disambiguation has been disabled, during memory disambiguationtraining, or when hardware watchdog is activated. A hardware watchdog isa feature included on many computers, whose purpose is to reboot thecomputer automatically in case the system hangs. Once the watchdog isactivated, it is to receive a ping at regular intervals from the system,and, if the hardware watchdog does not, the hardware watchdog will causea hardware reset.

The system microarchitecture 400 may further include a cache latencyimplementation, e.g., specific load-to-use embodiments. A 16-bit (orother value) saturating counter may be present in core clocks for loadoperations monitored by the present micro-architecture. The duration ofa cache-miss interval may be defined, per each of: (i) from fill bufferallocation by monitored load; (ii) from monitored load hit (squashed)into a fill buffer allocated by some earlier request, e.g., of anearlier dispatched instruction; and (iii) from a homeless prefetchissued as a result of monitored load when the fill buffer has no room.The term squashed is to say that the load operation merges into anexisting fill buffer.

In the above-described cache latency embodiments, the cache miss latencycounter may stop on monitored load writeback in the above-listed cases.The default latency, e.g., with no cache miss, may be five (“5”) clockcycles, the L1 cache hit latency of the processor core 102. If themonitored load completes without allocating/merging into a fill buffer,the counter should reset to a value of five clock cycles. Further, thecache latency on “hit” of the cache may be five clock cycles and thecache latency on a miss of the cache may be some value greater than fiveclock cycles. On JEClear, e.g., branch mispredict, the 16-bit saturatingcounter may be reset if the JEclear is older than the monitored load.For memory renaming, a load check may update the counter value to zero,which may indicate an even shorter latency to software.

FIG. 5 is a flow chart of a method 500 for precise longitudinalmonitoring of memory load operations according to various embodiments.The method 500 may be performed by processing logic that may includehardware (e.g., circuitry, dedicated logic, programmable logic,microcode, etc.), software (such as instructions run on a processingdevice, a computer system, or a dedicated machine), firmware, or acombination thereof. In one embodiment, the method 500 may be performed,in part, by the processor core 102 described above with respect to FIG.1.

For simplicity of explanation, the method 500 is depicted and describedas a series of acts. However, acts in accordance with this disclosurecan occur in various orders and/or concurrently and with other acts notpresented and described herein. Furthermore, not all illustrated actsmay be performed to implement the method 300 in accordance with thedisclosed subject matter. In addition, those skilled in the art willunderstand and appreciate that the method 300 could alternatively berepresented as a series of interrelated states via a state diagram orevents.

Referring to FIG. 5, the method 500 may begin with the processing logicrandomly selecting a load operation to monitor (510). The method 500 maycontinue with the processing logic determining a re-order bufferidentifier of the load operation (520). The method 500 may continue withthe processing logic transmitting the re-order buffer identifier to thememory subsystem (530).

In some embodiments, the method 500 may continue with the processinglogic, in response to dispatch of the load operation for execution,starting to increment an instruction latency counter associated with there-order buffer identifier (515). The method 500 may continue with theprocessing logic detecting a write back to the re-order buffer of theload operation from the memory subsystem, designating completion of theload operation (525). The method may continue with the processing logicstopping a counter value of the instruction latency counter in responseto the write back to the re-order buffer from the memory subsystem(535).

With continued reference to FIG. 5, the method 500 may continue with theprocessing logic storing in a status register, by respective memorysubunit(s) of the memory subsystem responsive to receipt of the re-orderbuffer identifier, a piece of information related to a status of theload operation (540). The method 500 may continue with the processinglogic detecting retirement of the load operation (550). The method 500may continue with the processing logic retrieving a data access address(e.g., DLA) of the load operation (555). The method 500 may continuewith the processing logic storing, in response to detecting theretirement of the load operation, each piece of information fromrespective status registers, the data access address, and the countervalue into corresponding fields of a record of a memory buffer (560).Each field may be associated with a different memory subunit, as perexemplary record in Table 1. The method 500 may continue with softwarereading out a series of memory buffer records, to include the recordreferenced at block 560, as performance monitoring data (570).

FIG. 6 illustrates a block diagram of the microarchitecture for aprocessor 600 (e.g., processing device 60) that includes hybrid cores inaccordance with one embodiment of the disclosure. In some embodiments,an instruction in accordance with one embodiment can be implemented tooperate on data elements having sizes of byte, word, doubleword,quadword, etc., as well as datatypes, such as single and doubleprecision integer and floating point datatypes. In one embodiment thein-order front end 601 is the part of the processor 600 that fetchesinstructions to be executed and prepares them to be used later in theprocessor pipeline.

The front end 601 may include several units. In one embodiment, theinstruction prefetcher 626 fetches instructions from memory and feedsthem to an instruction decoder 628 which in turn decodes or interpretsthem. For example, in one embodiment, the decoder decodes a receivedinstruction into one or more operations called “micro-instructions” or“micro-operations” (also called micro op or μops) that the machine canexecute. In other embodiments, the decoder parses the instruction intoan opcode and corresponding data and control fields that are used by themicroarchitecture to perform operations in accordance with oneembodiment. In one embodiment, the trace cache 630 takes decoded μopsand assembles them into program ordered sequences or traces in the uopqueue 634 for execution. When the trace cache 630 encounters a complexinstruction, the microcode ROM 632 provides the uops needed to completethe operation.

Some instructions are converted into a single micro-op, whereas othersneed several micro-ops to complete the full operation. In oneembodiment, if more than four micro-ops are needed to complete aninstruction, the decoder 628 accesses the microcode ROM 632 to do theinstruction. For one embodiment, an instruction can be decoded into asmall number of micro ops for processing at the instruction decoder 628.In another embodiment, an instruction can be stored within the microcodeROM 632 should a number of micro-ops be needed to accomplish theoperation. The trace cache 630 refers to an entry point programmablelogic array (PLA) to determine a correct micro-instruction pointer forreading the micro-code sequences to complete one or more instructions inaccordance with one embodiment from the micro-code ROM 632. After themicrocode ROM 632 finishes sequencing micro-ops for an instruction, thefront end 601 of the machine resumes fetching micro-ops from the tracecache 630.

The out-of-order execution engine 603 is where the instructions areprepared for execution. The out-of-order execution logic has a number ofbuffers to smooth out and re-order the flow of instructions to optimizeperformance as they go down the pipeline and get scheduled forexecution. The allocator logic allocates the machine buffers andresources that each μop needs in order to execute. The register renaminglogic renames logic registers onto entries in a register file. Theallocator also allocates an entry for each μop in one of the two μopqueues, one for memory operations and one for non-memory operations, infront of the instruction schedulers: memory scheduler, fast scheduler602, slow/general floating point scheduler 604, and simple floatingpoint scheduler 606. The μop schedulers 602, 604, 606, determine when auop is ready to execute based on the readiness of their dependent inputregister operand sources and the availability of the execution resourcesthe uops need to complete their operation. The fast scheduler 602 of oneembodiment can schedule on each half of the main clock cycle while theother schedulers can only schedule once per main processor clock cycle.The schedulers arbitrate for the dispatch ports to schedule μops forexecution.

Register files 608, 610, sit between the schedulers 602, 604, 606, andthe execution units 612, 614, 616, 618, 620, 622, 624 in the executionblock 611. There is a separate register file 608, 610, for integer andfloating point operations, respectively. Each register file 608, 610, ofone embodiment also includes a bypass network that can bypass or forwardjust completed results that have not yet been written into the registerfile to new dependent uops. The integer register file 608 and thefloating point register file 610 are also capable of communicating datawith the other. For one embodiment, the integer register file 608 issplit into two separate register files, one register file for the loworder 32 bits of data and a second register file for the high order 32bits of data. The floating point register file 610 of one embodiment has128 bit wide entries because floating point instructions typically haveoperands from 64 to 128 bits in width.

The execution block 611 contains the execution units 612, 614, 616, 618,620, 622, 624, where the instructions are actually executed. Thissection includes the register files 608, 610, that store the integer andfloating point data operand values that the micro-instructions need toexecute. The processor 600 of one embodiment is comprised of a number ofexecution units: address generation unit (AGU) 612, AGU 614, fast ALU616, fast ALU 618, slow ALU 620, floating point ALU 622, floating pointmove unit 624. For one embodiment, the floating point execution blocks622, 624, execute floating point, MMX, SIMD, and SSE, or otheroperations. The floating point ALU 622 of one embodiment includes a 64bit by 64 bit floating point divider to execute divide, square root, andremainder micro-ops. For embodiments of the present disclosure,instructions involving a floating point value may be handled with thefloating point hardware.

In one embodiment, the ALU operations go to the high-speed ALU executionunits 616, 618. The fast ALUs 616, 618, of one embodiment can executefast operations with an effective latency of half a clock cycle. For oneembodiment, most complex integer operations go to the slow ALU 620 asthe slow ALU 620 includes integer execution hardware for long latencytype of operations, such as a multiplier, shifts, flag logic, and branchprocessing. Memory load/store operations are executed by the AGUs 612,614. For one embodiment, the integer ALUs 616, 618, 620, are describedin the context of performing integer operations on 64 bit data operands.In alternative embodiments, the ALUs 616, 618, 620, can be implementedto support a variety of data bits including 16, 32, 128, 256, etc.Similarly, the floating point units 622, 624, can be implemented tosupport a range of operands having bits of various widths. For oneembodiment, the floating point units 622, 624, can operate on 128 bitswide packed data operands in conjunction with SIMD and multimediainstructions.

In one embodiment, the μops schedulers 602, 604, 606, dispatch dependentoperations before the parent load has finished executing. As μops arespeculatively scheduled and executed in processor 600, the processor 600also includes logic to handle memory misses. If a data load misses inthe data cache, there can be dependent operations in flight in thepipeline that have left the scheduler with temporarily incorrect data. Areplay mechanism tracks and re-executes instructions that use incorrectdata. Only the dependent operations need to be replayed and theindependent ones are allowed to complete. The schedulers and replaymechanism of one embodiment of a processor are also designed to catchinstruction sequences for text string comparison operations.

The processor 600 also includes logic to implement store addressprediction for memory disambiguation according to embodiments of thedisclosure. In one embodiment, the execution block 611 of processor 600may include a store address predictor (not shown) for implementing storeaddress prediction for memory disambiguation.

The term “registers” may refer to the on-board processor storagelocations that are used as part of instructions to identify operands. Inother words, registers may be those that are usable from the outside ofthe processor (from a programmer's perspective). However, the registersof an embodiment should not be limited in meaning to a particular typeof circuit. Rather, a register of an embodiment is capable of storingand providing data, and performing the functions described herein. Theregisters described herein can be implemented by circuitry within aprocessor using any number of different techniques, such as dedicatedphysical registers, dynamically allocated physical registers usingregister renaming, combinations of dedicated and dynamically allocatedphysical registers, etc. In one embodiment, integer registers storethirty-two bit integer data. A register file of one embodiment alsocontains eight multimedia SIMD registers for packed data.

For the discussions below, the registers are understood to be dataregisters designed to hold packed data, such as 64 bits wide MMX™registers (also referred to as ‘mm’ registers in some instances) inmicroprocessors enabled with MMX technology from Intel Corporation ofSanta Clara, Calif. These MMX registers, available in both integer andfloating point forms, can operate with packed data elements thataccompany SIMD and SSE instructions. Similarly, 128 bits wide XMMregisters relating to SSE2, SSE3, SSE4, or beyond (referred togenerically as “SSEx”) technology can also be used to hold such packeddata operands. In one embodiment, in storing packed data and integerdata, the registers do not need to differentiate between the two datatypes. In one embodiment, integer and floating point are eithercontained in the same register file or different register files.Furthermore, in one embodiment, floating point and integer data may bestored in different registers or the same registers.

Referring now to FIG. 7, shown is a block diagram illustrating a system700 in which an embodiment of the disclosure may be used. As shown inFIG. 7, multiprocessor system 700 is a point-to-point interconnectsystem, and includes a first processor 770 and a second processor 780coupled via a point-to-point interconnect 750. While shown with only twoprocessors 770, 780, it is to be understood that the scope ofembodiments of the disclosure is not so limited. In other embodiments,one or more additional processors may be present in a given processor.In one embodiment, the multiprocessor system 700 may implement hybridcores as described herein.

Processors 770 and 780 are shown including integrated memory controllerunits 772 and 782, respectively. Processor 770 also includes as part ofits bus controller units point-to-point (P-P) interfaces 776 and 778;similarly, second processor 780 includes P-P interfaces 786 and 788.Processors 770, 780 may exchange information via a point-to-point (P-P)interface 750 using P-P interface circuits 778, 788. As shown in FIG. 7,IMCs 772 and 782 couple the processors to respective memories, namely amemory 732 and a memory 734, which may be portions of main memorylocally attached to the respective processors.

Processors 770, 780 may each exchange information with a chipset 790 viaindividual P-P interfaces 752, 754 using point to point interfacecircuits 776, 794, 786, 798. Chipset 790 may also exchange informationwith a high-performance graphics circuit 738 via a high-performancegraphics interface 739.

A shared cache (not shown) may be included in either processor oroutside of both processors, yet connected with the processors via P-Pinterconnect, such that either or both processors' local cacheinformation may be stored in the shared cache if a processor is placedinto a low power mode.

Chipset 790 may be coupled to a first bus 716 via an interface 796. Inone embodiment, first bus 716 may be a Peripheral Component Interconnect(PCI) bus, or a bus such as a PCI Express bus or another thirdgeneration I/O interconnect bus, although the scope of the presentdisclosure is not so limited.

As illustrated in FIG. 7, various I/O devices 714 may be coupled tofirst bus 716, along with a bus bridge 718 which couples first bus 716to a second bus 720. In one embodiment, second bus 720 may be a low pincount (LPC) bus. Various devices may be coupled to second bus 720including, for example, a keyboard and/or mouse 722, communicationdevices 727 and a storage unit 728 such as a disk drive or other massstorage device which may include instructions/code and data 730, in oneembodiment. Further, an audio I/O 724 may be coupled to second bus 720.Note that other architectures are possible. For example, instead of thepoint-to-point architecture of FIG. 7, a system may implement amulti-drop bus or other such architecture.

Referring now to FIG. 8, shown is a block diagram of a system 800 inwhich one embodiment of the disclosure may operate. The system 800 mayinclude one or more processors 810, 815, which are coupled to graphicsmemory controller hub (GMCH) 820. The optional nature of additionalprocessors 815 is denoted in FIG. 8 with broken lines. In oneembodiment, processors 810, 815 implement hybrid cores according toembodiments of the disclosure.

Each processor 810, 815 may be some version of the circuit, integratedcircuit, processor, and/or silicon integrated circuit as describedabove. However, it should be noted that it is unlikely that integratedgraphics logic and integrated memory control units would exist in theprocessors 810, 815. FIG. 8 illustrates that the GMCH 820 may be coupledto a memory 840 that may be, for example, a dynamic random access memory(DRAM). The DRAM may, for at least one embodiment, be associated with anon-volatile cache.

The GMCH 820 may be a chipset, or a portion of a chipset. The GMCH 820may communicate with the processor(s) 810, 815 and control interactionbetween the processor(s) 810, 815 and memory 840. The GMCH 820 may alsoact as an accelerated bus interface between the processor(s) 810, 815and other elements of the system 800. For at least one embodiment, theGMCH 820 communicates with the processor(s) 810, 815 via a multi-dropbus, such as a frontside bus (FSB) 895.

Furthermore, GMCH 820 is coupled to a display 845 (such as a flat panelor touchscreen display). GMCH 820 may include an integrated graphicsaccelerator. GMCH 820 is further coupled to an input/output (I/O)controller hub (ICH) 850, which may be used to couple various peripheraldevices to system 800. Shown for example in the embodiment of FIG. 8 isan external graphics device 860, which may be a discrete graphicsdevice, coupled to ICH 850, along with another peripheral device 870.

Alternatively, additional or different processors may also be present inthe system 800. For example, additional processor(s) 815 may includeadditional processors(s) that are the same as processor 810, additionalprocessor(s) that are heterogeneous or asymmetric to processor 810,accelerators (such as, e.g., graphics accelerators or digital signalprocessing (DSP) units), field programmable gate arrays, or any otherprocessor. There can be a variety of differences between theprocessor(s) 810, 815 in terms of a spectrum of metrics of meritincluding architectural, micro-architectural, thermal, power consumptioncharacteristics, and the like. These differences may effectivelymanifest themselves as asymmetry and heterogeneity amongst theprocessors 810, 815. For at least one embodiment, the various processors810, 815 may reside in the same die package.

Referring now to FIG. 9, shown is a block diagram of a system 900 inwhich an embodiment of the disclosure may operate. FIG. 9 illustratesprocessors 970, 980. In one embodiment, processors 970, 980 mayimplement hybrid cores as described above. Processors 970, 980 mayinclude integrated memory and I/O control logic (“CL”) 972 and 982,respectively and intercommunicate with each other via point-to-pointinterconnect 950 between point-to-point (P-P) interfaces 978 and 988respectively. Processors 970, 980 each communicate with chipset 990 viapoint-to-point interconnects 952 and 954 through the respective P-Pinterfaces 976 to 994 and 986 to 998 as shown. For at least oneembodiment, the CL 972, 982 may include integrated memory controllerunits. CLs 972, 982 may include I/O control logic. As depicted, memories932, 934 coupled to CLs 972, 982 and I/O devices 914 are also coupled tothe control logic 972, 982. Legacy I/O devices 915 are coupled to thechipset 990 via interface 996.

Embodiments may be implemented in many different system types. FIG. 10is a block diagram of a SoC 1000 in accordance with an embodiment of thepresent disclosure. Dashed lined boxes are optional features on moreadvanced SoCs. In FIG. 10, an interconnect unit(s) 1012 is coupled to:an application processor 1020 which includes a set of one or more cores1002A-N and shared cache unit(s) 1006; a system agent unit 1010; a buscontroller unit(s) 1016; an integrated memory controller unit(s) 1014; aset or one or more media processors 1018 which may include integratedgraphics logic 1008, an image processor 1024 for providing still and/orvideo camera functionality, an audio processor 1026 for providinghardware audio acceleration, and a video processor 1028 for providingvideo encode/decode acceleration; an static random access memory (SRAM)unit 1030; a direct memory access (DMA) unit 1032; and a display unit1040 for coupling to one or more external displays. In one embodiment, amemory module may be included in the integrated memory controllerunit(s) 1014. In another embodiment, the memory module may be includedin one or more other components of the SoC 1000 that may be used toaccess and/or control a memory. The application processor 1020 mayinclude a store address predictor for implementing hybrid cores asdescribed in embodiments herein.

The memory hierarchy includes one or more levels of cache within thecores, a set or one or more shared cache units 1006, and external memory(not shown) coupled to the set of integrated memory controller units1014. The set of shared cache units 1006 may include one or moremid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), orother levels of cache, a last level cache (LLC), and/or combinationsthereof.

In some embodiments, one or more of the cores 1002A-N are capable ofmulti-threading. The system agent 1010 includes those componentscoordinating and operating cores 1002A-N. The system agent unit 1010 mayinclude for example a power control unit (PCU) and a display unit. ThePCU may be or include logic and components needed for regulating thepower state of the cores 1002A-N and the integrated graphics logic 1008.The display unit is for driving one or more externally connecteddisplays.

The cores 1002A-N may be homogenous or heterogeneous in terms ofarchitecture and/or instruction set. For example, some of the cores1002A-N may be in order while others are out-of-order. As anotherexample, two or more of the cores 1002A-N may be capable of executionthe same instruction set, while others may be capable of executing onlya subset of that instruction set or a different instruction set.

The application processor 1020 may be a general-purpose processor, suchas a Core™ i3, i5, i7, 2 Duo and Quad, Xeon™, Itanium™, Atom™ or Quark™processor, which are available from Intel™ Corporation, of Santa Clara,Calif. Alternatively, the application processor 1020 may be from anothercompany, such as ARM Holdings™, Ltd, MIPS™, etc. The applicationprocessor 1020 may be a special-purpose processor, such as, for example,a network or communication processor, compression engine, graphicsprocessor, co-processor, embedded processor, or the like. Theapplication processor 1020 may be implemented on one or more chips. Theapplication processor 1020 may be a part of and/or may be implemented onone or more substrates using any of a number of process technologies,such as, for example, BiCMOS, CMOS, or NMOS.

FIG. 11 is a block diagram of an embodiment of a system on-chip (SoC)design in accordance with the present disclosure. As a specificillustrative example, SoC 1100 is included in user equipment (UE). Inone embodiment, UE refers to any device to be used by an end-user tocommunicate, such as a hand-held phone, smartphone, tablet, ultra-thinnotebook, notebook with broadband adapter, or any other similarcommunication device. Often a UE connects to a base station or node,which potentially corresponds in nature to a mobile station (MS) in aGSM network.

Here, SOC 1100 includes 2 cores-1106 and 1107. Cores 1106 and 1107 mayconform to an Instruction Set Architecture, such as an Intel®Architecture Core™-based processor, an Advanced Micro Devices, Inc.(AMD) processor, a MIPS-based processor, an ARM-based processor design,or a customer thereof, as well as their licensees or adopters. Cores1106 and 1107 are coupled to cache control 1108 that is associated withbus interface unit 1109 and L2 cache 1110 to communicate with otherparts of system 1100. Interconnect 1111 includes an on-chipinterconnect, such as an IOSF, AMBA, or other interconnect discussedabove, which potentially implements one or more aspects of the describeddisclosure. In one embodiment, cores 1106, 1107 may implement hybridcores as described in embodiments herein.

Interconnect 1111 provides communication channels to the othercomponents, such as a Subscriber Identity Module (SIM) 1130 to interfacewith a SIM card, a boot ROM 1135 to hold boot code for execution bycores 1106 and 1107 to initialize and boot SoC 1100, a SDRAM controller1140 to interface with external memory (e.g. DRAM 1160), a flashcontroller 1145 to interface with non-volatile memory (e.g. Flash 1165),a peripheral control 1150 (e.g. Serial Peripheral Interface) tointerface with peripherals, video codecs 1120 and Video interface 1125to display and receive input (e.g. touch enabled input), GPU 1115 toperform graphics related computations, etc. Any of these interfaces mayincorporate aspects of the disclosure described herein. In addition, thesystem 1100 illustrates peripherals for communication, such as aBluetooth module 1170, 3G modem 1175, GPS 1180, and Wi-Fi 1185.

FIG. 12 illustrates a diagrammatic representation of a machine in theexample form of a computer system 1200 within which a set ofinstructions, for causing the machine to perform any one or more of themethodologies discussed herein, may be executed. In alternativeembodiments, the machine may be connected (e.g., networked) to othermachines in a LAN, an intranet, an extranet, or the Internet. Themachine may operate in the capacity of a server or a client device in aclient-server network environment, or as a peer machine in apeer-to-peer (or distributed) network environment. The machine may be apersonal computer (PC), a tablet PC, a set-top box (STB), a PersonalDigital Assistant (PDA), a cellular telephone, a web appliance, aserver, a network router, switch or bridge, or any machine capable ofexecuting a set of instructions (sequential or otherwise) that specifyactions to be taken by that machine. Further, while only a singlemachine is illustrated, the term “machine” shall also be taken toinclude any collection of machines that individually or jointly executea set (or multiple sets) of instructions to perform any one or more ofthe methodologies discussed herein.

The computer system 1200 includes a processing device 1202, a mainmemory 1204 (e.g., read-only memory (ROM), flash memory, dynamic randomaccess memory (DRAM) (such as synchronous DRAM (SDRAM) or DRAM (RDRAM),etc.), a static memory 1206 (e.g., flash memory, static random accessmemory (SRAM), etc.), and a data storage device 1216, which communicatewith each other via a bus 1230.

Processing device 1202 represents one or more general-purpose processingdevices such as a microprocessor, central processing unit, or the like.More particularly, the processing device may be complex instruction setcomputing (CISC) microprocessor, reduced instruction set computer (RISC)microprocessor, very long instruction word (VLIW) microprocessor, orprocessor implementing other instruction sets, or processorsimplementing a combination of instruction sets. Processing device 1202may also be one or more special-purpose processing devices such as anapplication specific integrated circuit (ASIC), a field programmablegate array (FPGA), a digital signal processor (DSP), network processor,or the like. In one embodiment, processing device 1202 may include oneor more processing cores. The processing device 1202 is configured toexecute the processing logic 1226 for performing the operations andsteps discussed herein. For example, processing logic 1226 may performoperations as described in FIGS. 4-5.

The computer system 1200 may further include a network interface device1208 communicably coupled to a network 1220. The computer system 1200also may include a video display unit 1210 (e.g., a liquid crystaldisplay (LCD) or a cathode ray tube (CRT)), an alphanumeric input device1212 (e.g., a keyboard), a cursor control device 1214 (e.g., a mouse),and a signal generation device 1220 (e.g., a speaker). Furthermore,computer system 1200 may include a graphics processing unit 1222, avideo processing unit 1228, and an audio processing unit 1232.

The data storage device 1216 may include a machine-accessible storagemedium 1224 on which is stored software 1226 implementing any one ormore of the methodologies of functions described herein, such asimplementing store address prediction for memory disambiguation asdescribed above. The software 1226 may also reside, completely or atleast partially, within the main memory 1204 as instructions 1226 and/orwithin the processing device 1202 as processing logic 1226 duringexecution thereof by the computer system 1200; the main memory 1204 andthe processing device 1202 also constituting machine-accessible storagemedia.

The machine-readable storage medium 1224 may also be used to storeinstructions 1226 implementing store address prediction for hybrid coressuch as described according to embodiments of the disclosure. While themachine-accessible storage medium 1224 is shown in an example embodimentto be a single medium, the term “machine-accessible storage medium”should be taken to include a single medium or multiple media (e.g., acentralized or distributed database, and/or associated caches andservers) that store the one or more sets of instructions. The term“machine-accessible storage medium” shall also be taken to include anymedium that is capable of storing, encoding or carrying a set ofinstruction for execution by the machine and that cause the machine toperform any one or more of the methodologies of the present disclosure.The term “machine-accessible storage medium” shall accordingly be takento include, but not be limited to, solid-state memories, and optical andmagnetic media.

The following examples pertain to further embodiments.

Example 1 is a processing device comprising: 1) a memory subsystemcomprising a first memory subunit that includes a status register; 2) anexecution engine unit coupled to the memory subsystem, the executionengine unit to: a) randomly select a load operation to monitor; b)determine a re-order buffer identifier of the load operation; and c)transmit the re-order buffer identifier to the memory subsystem; andwherein, d) responsive to receipt of the re-order buffer identifier, thefirst memory subunit is to store a piece of information, related to astatus of the load operation, in the status register, and e) responsiveto detection of retirement of the load operation, store the piece ofinformation from the status register into a particular field of a recordof a memory buffer, wherein the particular field is associated with thefirst memory subunit.

In Example 2, the processing device of Example 1, wherein the executionengine unit comprises 1) a re-order buffer that generates the re-orderbuffer identifier, the re-order buffer comprising: 2) a linear feedbackshift register to generate a random number that is to select the loadoperation; and 3) an instruction latency counter to: a) startincrementing a counter value responsive to a dispatch of the loadoperation; and b) stop the counter value responsive to a write back ofthe load operation, from the memory subsystem, to the re-order buffer;and wherein, c) in response to detection of the retirement of the loadoperation, the first memory subunit is further to store the countervalue in an access latency field of the record, which is accessible bysoftware.

In Example 3, the processor of Example 2, wherein the execution engineunit further comprises 1) a unified scheduler to: a) dispatch the loadoperation for execution in response to the random selection of the loadoperation; and 2) forward the re-order buffer identifier to the re-orderbuffer in advance of detection of the write back, to signal to there-order buffer that the load operation is being monitored.

In Example 4, the processor of Example 1, wherein the first memorysubunit is a memory ordering buffer and the piece of information is anunknown store address.

In Example 5, the processor of Example 1, wherein the first memorysubunit is further to write a data access address of the load operationinto the record of the memory buffer, wherein the first memory subunitis a memory ordering buffer and the piece of information is whether theload operation is blocked due to an address collision with an earlierstore operation.

In Example 6, the processor of Example 1, wherein the first memorysubunit is a data translation lookaside buffer and the piece ofinformation is one of presence or absence of a miss of the datatranslation lookaside buffer.

In Example 7, the processor of Example 1, wherein the first memorysubunit is a data cache unit and the piece of information is a cachelatency value of clock cycles for cache access.

In Example 8, the processor of Example 1, wherein the first memorysubunit is further to write a value for an instruction pointerassociated with the load operation into the record of the memory buffer.

Various implementations may have different combinations of thestructural features described above. For instance, all optional featuresof the processors and methods described above may also be implementedwith respect to a system described herein and specifics in the examplesmay be used anywhere in one or more implementations.

Example 9 is a system comprising: 1) a memory from which to retrievedata to complete load operations; 2) a core to execute microcode andsoftware, the core comprising: 3) a memory subsystem coupled to thememory, wherein the memory subsystem comprises a plurality of memorysubunits, each containing a status register; and 4) an execution engineunit coupled to the memory subsystem and to the core, the executionengine unit to: a) randomly select a load operation to monitor from theload operations, the load operation associated with a thread currentlyexecuted by the core; b) determine a re-order buffer identifier of theload operation; and c) transmit the re-order buffer identifier to thememory subsystem; wherein, d) responsive to receipt of the re-orderbuffer identifier, each of the plurality of memory subunits is to storea piece of information, related to a status of the load operation, inthe status register corresponding to respective memory subunit; and e)wherein the core is to: f) detect retirement of the load operation; andg) in response to detection of the retirement of the load operation,store the piece of information from each status register into acorresponding field of a record of a memory buffer, wherein the recordis accessible by the software as performance monitoring data.

In Example 10, the system of Example 9, wherein the execution engineunit comprises 1) a re-order buffer that generates the re-order bufferidentifier, the re-order buffer comprising: 2) a linear feedback shiftregister to generate a random number that is to select the loadoperation; and 3) an instruction latency counter to: a) startincrementing a counter value responsive to a dispatch of the loadoperation; and b) stop the counter value responsive to a write back ofthe load operation, from the memory subsystem, to the re-order buffer;and wherein, c) in response to detection of the retirement of the loadoperation, the core is further to store the counter value in an accesslatency field of the record, which is accessible by the software.

In Example 11, the system of Example 10, wherein the execution engineunit further comprises 1) a unified scheduler to: a) dispatch the loadoperation in response to the random selection of the load operation; andb) forward the re-order buffer identifier to the re-order buffer inadvance of detection of the write back, to signal to the re-order bufferthat the write back is for the load operation that is being monitored bythe instruction latency counter.

In Example 12, the system of Example 9, wherein the plurality of memorysubunits comprises a memory ordering buffer, and in response todetecting the load operation is blocked by a preceding store forwardoperation with an overlapping linear address, the memory ordering bufferis to set a bit of the status register of the memory ordering buffer.

In Example 13, the system of Example 9, wherein the plurality of memorysubunits comprises a memory ordering buffer, and in response todetecting the load operation is blocked by an unknown linear storeaddress, the memory ordering buffer is to set a bit of the statusregister of the memory ordering buffer.

In Example 14, the system of Example 9, wherein the plurality of memorysubunits comprises a data translation lookaside buffer for which thepiece of information is one of a hit or a miss of the data translationlookaside buffer.

In Example 15, the system of Example 9, wherein the plurality of memorysubunits comprises a data cache unit for which the piece of informationis a cache latency value of clock cycles for cache access.

Various implementations may have different combinations of thestructural features described above. For instance, all optional featuresof the processors and methods described above may also be implementedwith respect to a system described herein and specifics in the examplesmay be used anywhere in one or more implementations.

Example 16 is an method comprising: a) randomly selecting, by anexecution engine unit coupled to a memory subsystem, a load operation tomonitor; b) determining, by the execution engine unit, a re-order bufferidentifier of the load operation; c) transmitting, by the executionengine unit, the re-order buffer identifier to the memory subsystem; d)storing in a status register, by a first memory subunit of the memorysubsystem responsive to receipt of the re-order buffer identifier, apiece of information related to a status of the load operation; e)detecting, by a processor that includes the execution engine unit,retirement of the load operation; and f) storing, by the processor inresponse to detecting the retirement of the load operation, the piece ofinformation from the status register into a particular field of a recordof a memory buffer, wherein the particular field is associated with thefirst memory subunit.

In Example 17, the method of Example 23, further comprising: a)starting, by the execution engine unit, to increment a counter value ofan instruction latency counter responsive to a dispatch of the loadoperation; b) stopping, by the execution engine unit, the counter valueresponsive to a write back of the load operation, from the memorysubsystem, to the re-order buffer; and c) storing, by the processor inresponse to the detecting the retirement of the load operation, thecounter value in an access latency field of the record, which isaccessible by software.

In Example 18, the method of Example 17, further comprising: a)dispatching, by a unified scheduler of the execution engine unit, theload operation in response to the random selection of the loadoperation; and b) forwarding, by the unified scheduler, the re-orderbuffer identifier to the re-order buffer in advance of detection of thewrite back, to signal to the re-order buffer that the write back is forthe load operation that is being monitored by the instruction latencycounter.

In Example 19, the method of Example 16, further comprising writing, bythe processor, a value for an instruction pointer associated with theload operation into the record of the memory buffer.

In Example 20, the method of Example 16, wherein the first memorysubunit is a memory ordering buffer and the piece of information iswhether the load operation is blocked due to an address collision withan earlier store operation, the method further comprising writing, bythe processor, a data access address of the load operation into therecord of the memory buffer.

In Example 21, the method of Example 16, wherein the first memorysubunit comprises a memory ordering buffer, and in response to detectingthe load operation is blocked by a preceding store forward operationwith an overlapping linear address, setting, by the memory orderingbuffer, a bit of the status register of the memory ordering buffer.

In Example 22, the method of Example 16, wherein the first memorysubunit comprises a memory ordering buffer, and in response to detectingthe load operation is blocked by an unknown linear store address,setting, by the memory ordering buffer, a bit of the status register ofthe memory ordering buffer.

Various implementations may have different combinations of thestructural features described above. For instance, all optional featuresof the processors and methods described above may also be implementedwith respect to a system described herein and specifics in the examplesmay be used anywhere in one or more implementations.

Example 23 is a non-transitory computer-readable storage medium storinginstructions that, when executed by a processing device, cause theinstructions to perform a plurality of operations comprising: a)randomly selecting, by an execution engine unit coupled to a memorysubsystem, a load operation to monitor; b) determining, by the executionengine unit, a re-order buffer identifier of the load operation; c)transmitting, by the execution engine unit, the re-order bufferidentifier to the memory subsystem; d) storing in a status register, bya first memory subunit of the memory subsystem responsive to receipt ofthe re-order buffer identifier, a piece of information related to astatus of the load operation; e) detecting, by a processor that includesthe execution engine unit, retirement of the load operation; and f)storing, by the processor in response to detecting the retirement of theload operation, the piece of information from the status register into aparticular field of a record of a memory buffer, wherein the particularfield is associated with the first memory subunit.

In Example 24, the non-transitory computer-readable storage medium ofExample 23, the operations further comprising: a) starting, by theexecution engine unit, to increment a counter value of an instructionlatency counter responsive to a dispatch of the load operation; b)stopping, by the execution engine unit, the counter value responsive toa write back of the load operation, from the memory subsystem, to there-order buffer; and c) storing, by the processor in response to thedetecting the retirement of the load operation, the counter value in anaccess latency field of the record, which is accessible by software.

In Example 25, the non-transitory computer-readable storage medium ofExample 24, the operations further comprising: a) dispatching, by aunified scheduler of the execution engine unit, the load operation inresponse to the random selection of the load operation; and b)forwarding, by the unified scheduler, the re-order buffer identifier tothe re-order buffer in advance of detection of the write back, to signalto the re-order buffer that the write back is for the load operationthat is being monitored by the instruction latency counter.

In Example 26, the non-transitory computer-readable storage medium ofExample 23, the operations further comprising writing, by the processor,a value for an instruction pointer associated with the load operationinto the record of the memory buffer.

In Example 27, the non-transitory computer-readable storage medium ofExample 23, wherein the first memory subunit is a memory ordering bufferand the piece of information is whether the load operation is blockeddue to an address collision with an earlier store operation, theoperations further comprising writing, by the processor, a data accessaddress of the load operation into the record of the memory buffer.

In Example 28, the non-transitory computer-readable storage medium ofExample 23, wherein the first memory subunit comprises a memory orderingbuffer, and in response to detecting the load operation is blocked by apreceding store forward operation with an overlapping linear address,the operations further comprising setting, by the memory orderingbuffer, a bit of the status register of the memory ordering buffer.

In Example 29, the non-transitory computer-readable storage medium ofExample 23, wherein the first memory subunit comprises a memory orderingbuffer, and in response to detecting the load operation is blocked by anunknown linear store address, the operations further comprising setting,by the memory ordering buffer, a bit of the status register of thememory ordering buffer.

Various implementations may have different combinations of thestructural features described above. For instance, all optional featuresof the processors and methods described above may also be implementedwith respect to a system described herein and specifics in the examplesmay be used anywhere in one or more implementations.

Example 30 is a system comprising: a) means for randomly selecting aload operation to monitor; b) means for determining a re-order bufferidentifier of the load operation; c) means for transmitting the re-orderbuffer identifier to a memory subsystem; d) means for storing, by afirst memory subunit of the memory subsystem responsive to receipt ofthe re-order buffer identifier, a piece of information related to astatus of the load operation; e) means for detecting retirement of theload operation; and f) means for storing, in response to detecting theretirement of the load operation, the piece of information into aparticular field of a record of a memory buffer, wherein the particularfield is associated with the first memory subunit.

In Example 31, the system of Example 30, further comprising: a) meansfor starting to increment a counter value of an instruction latencycounter responsive to a dispatch of the load operation; b) means forstopping the counter value responsive to a write back of the loadoperation, from the memory subsystem, to the re-order buffer; and c)means for storing, in response to the detecting the retirement of theload operation, the counter value in an access latency field of therecord, which is accessible by software.

In Example 32, the system of Example 31, further comprising: a) meansfor dispatching the load operation in response to the random selectionof the load operation; and b) means for forwarding the re-order bufferidentifier to the re-order buffer in advance of detection of the writeback, to signal to the re-order buffer that the write back is for theload operation that is being monitored by the instruction latencycounter.

In Example 33, the system of Example 30, further comprising means forwriting a value for an instruction pointer associated with the loadoperation into the record of the memory buffer.

In Example 34, the system of Example 30, wherein the first memorysubunit is a memory ordering buffer and the piece of information iswhether the load operation is blocked due to an address collision withan earlier store operation, the method further comprising means forwriting a data access address of the load operation into the record ofthe memory buffer.

In Example 35, the system of Example 30, wherein the first memorysubunit comprises a memory ordering buffer, and in response to detectingthe load operation is blocked by a preceding store forward operationwith an overlapping linear address, means for setting, by the memoryordering buffer, a bit of the status register of the memory orderingbuffer.

In Example 36, the system of Example 30, wherein the first memorysubunit comprises a memory ordering buffer, and in response to detectingthe load operation is blocked by an unknown linear store address, meansfor setting, by the memory ordering buffer, a bit of the status registerof the memory ordering buffer.

A design may go through various stages, from creation to simulation tofabrication. Data representing a design may represent the design in anumber of manners. First, as is useful in simulations, the hardware maybe represented using a hardware description language or anotherfunctional description language. Additionally, a circuit level modelwith logic and/or transistor gates may be produced at some stages of thedesign process. Furthermore, most designs, at some stage, reach a levelof data representing the physical placement of various devices in thehardware model. In the case where conventional semiconductor fabricationtechniques are used, the data representing the hardware model may be thedata specifying the presence or absence of various features on differentmask layers for masks used to produce the integrated circuit. In anyrepresentation of the design, the data may be stored in any form of amachine readable medium. A memory or a magnetic or optical storage suchas a disc may be the machine readable medium to store informationtransmitted via optical or electrical wave modulated or otherwisegenerated to transmit such information. When an electrical carrier waveindicating or carrying the code or design is transmitted, to the extentthat copying, buffering, or re-transmission of the electrical signal isperformed, a new copy is made. Thus, a communication provider or anetwork provider may store on a tangible, machine-readable medium, atleast temporarily, an article, such as information encoded into acarrier wave, embodying techniques of embodiments of the presentdisclosure.

A module as used herein refers to any combination of hardware, software,and/or firmware. As an example, a module includes hardware, such as amicro-controller, associated with a non-transitory medium to store codeadapted to be executed by the micro-controller. Therefore, reference toa module, in one embodiment, refers to the hardware, which isspecifically configured to recognize and/or execute the code to be heldon a non-transitory medium. Furthermore, in another embodiment, use of amodule refers to the non-transitory medium including the code, which isspecifically adapted to be executed by the microcontroller to performpredetermined operations. And as can be inferred, in yet anotherembodiment, the term module (in this example) may refer to thecombination of the microcontroller and the non-transitory medium. Oftenmodule boundaries that are illustrated as separate commonly vary andpotentially overlap. For example, a first and a second module may sharehardware, software, firmware, or a combination thereof, whilepotentially retaining some independent hardware, software, or firmware.In one embodiment, use of the term logic includes hardware, such astransistors, registers, or other hardware, such as programmable logicdevices.

Use of the phrase ‘configured to,’ in one embodiment, refers toarranging, putting together, manufacturing, offering to sell, importingand/or designing an apparatus, hardware, logic, or element to perform adesignated or determined task. In this example, an apparatus or elementthereof that is not operating is still ‘configured to’ perform adesignated task if it is designed, coupled, and/or interconnected toperform said designated task. As a purely illustrative example, a logicgate may provide a 0 or a 1 during operation. But a logic gate‘configured to’ provide an enable signal to a clock does not includeevery potential logic gate that may provide a 1 or 0. Instead, the logicgate is one coupled in some manner that during operation the 1 or 0output is to enable the clock. Note once again that use of the term‘configured to’ does not require operation, but instead focus on thelatent state of an apparatus, hardware, and/or element, where in thelatent state the apparatus, hardware, and/or element is designed toperform a particular task when the apparatus, hardware, and/or elementis operating.

Furthermore, use of the phrases ‘to,’ ‘capable of/to,’ and/or ‘operableto,’ in one embodiment, refers to some apparatus, logic, hardware,and/or element designed in such a way to enable use of the apparatus,logic, hardware, and/or element in a specified manner. Note as abovethat use of ‘to,’ ‘capable of/to,’ and/or ‘operable to,’ in oneembodiment, refers to the latent state of an apparatus, logic, hardware,and/or element, where the apparatus, logic, hardware, and/or element isnot operating but is designed in such a manner to enable use of anapparatus in a specified manner.

A value, as used herein, includes any known representation of a number,a state, a logical state, or a binary logical state. Often, the use oflogic levels, logic values, or logical values is also referred to as 1'sand 0's, which simply represents binary logic states. For example, a 1refers to a high logic level and 0 refers to a low logic level. In oneembodiment, a storage cell, such as a transistor or flash cell, may becapable of holding a single logical value or multiple logical values.However, other representations of values in computer systems have beenused. For example the decimal number ten may also be represented as abinary value of 910 and a hexadecimal letter A. Therefore, a valueincludes any representation of information capable of being held in acomputer system.

Moreover, states may be represented by values or portions of values. Asan example, a first value, such as a logical one, may represent adefault or initial state, while a second value, such as a logical zero,may represent a non-default state. In addition, the terms reset and set,in one embodiment, refer to a default and an updated value or state,respectively. For example, a default value potentially includes a highlogical value, i.e. reset, while an updated value potentially includes alow logical value, i.e. set. Note that any combination of values may beutilized to represent any number of states.

The embodiments of methods, hardware, software, firmware or code setforth above may be implemented via instructions or code stored on amachine-accessible, machine readable, computer accessible, or computerreadable medium which are executable by a processing element. Anon-transitory machine-accessible/readable medium includes any mechanismthat provides (i.e., stores and/or transmits) information in a formreadable by a machine, such as a computer or electronic system. Forexample, a non-transitory machine-accessible medium includesrandom-access memory (RAM), such as static RAM (SRAM) or dynamic RAM(DRAM); ROM; magnetic or optical storage medium; flash memory devices;electrical storage devices; optical storage devices; acoustical storagedevices; other form of storage devices for holding information receivedfrom transitory (propagated) signals (e.g., carrier waves, infraredsignals, digital signals); etc., which are to be distinguished from thenon-transitory mediums that may receive information there from.

Instructions used to program logic to perform embodiments of thedisclosure may be stored within a memory in the system, such as DRAM,cache, flash memory, or other storage. Furthermore, the instructions canbe distributed via a network or by way of other computer readable media.Thus a machine-readable medium may include any mechanism for storing ortransmitting information in a form readable by a machine (e.g., acomputer), but is not limited to, floppy diskettes, optical disks,Compact Disc, Read-Only Memory (CD-ROMs), and magneto-optical disks,Read-Only Memory (ROMs), Random Access Memory (RAM), ErasableProgrammable Read-Only Memory (EPROM), Electrically ErasableProgrammable Read-Only Memory (EEPROM), magnetic or optical cards, flashmemory, or a tangible, machine-readable storage used in the transmissionof information over the Internet via electrical, optical, acoustical orother forms of propagated signals (e.g., carrier waves, infraredsignals, digital signals, etc.). Accordingly, the computer-readablemedium includes any type of tangible machine-readable medium suitablefor storing or transmitting electronic instructions or information in aform readable by a machine (e.g., a computer).

Reference throughout this specification to “one embodiment” or “anembodiment” means that a particular feature, structure, orcharacteristic described in connection with the embodiment is includedin at least one embodiment of the present disclosure. Thus, theappearances of the phrases “in one embodiment” or “in an embodiment” invarious places throughout this specification are not necessarily allreferring to the same embodiment. Furthermore, the particular features,structures, or characteristics may be combined in any suitable manner inone or more embodiments.

In the foregoing specification, a detailed description has been givenwith reference to specific exemplary embodiments. It will, however, beevident that various modifications and changes may be made theretowithout departing from the broader spirit and scope of the disclosure asset forth in the appended claims. The specification and drawings are,accordingly, to be regarded in an illustrative sense rather than arestrictive sense. Furthermore, the foregoing use of embodiment andother exemplarily language does not necessarily refer to the sameembodiment or the same example, but may refer to different and distinctembodiments, as well as potentially the same embodiment.

1. A processor comprising: a memory subsystem comprising a first memorysubunit that includes a status register; an execution engine unitcoupled to the memory subsystem, the execution engine unit to: randomlyselect a load operation to monitor; determine a re-order bufferidentifier of the load operation; and transmit the re-order bufferidentifier to the memory subsystem; and wherein, responsive to receiptof the re-order buffer identifier, the first memory subunit is to storea piece of information, related to a status of the load operation, inthe status register, and responsive to detection of retirement of theload operation, store the piece of information from the status registerinto a particular field of a record of a memory buffer, wherein theparticular field is associated with the first memory subunit.
 2. Theprocessor of claim 1, wherein the execution engine unit comprises are-order buffer that generates the re-order buffer identifier, there-order buffer comprising: a linear feedback shift register to generatea random number that is to select the load operation; and an instructionlatency counter to: start incrementing a counter value responsive to adispatch of the load operation; and stop the counter value responsive toa write back of the load operation, from the memory subsystem, to there-order buffer; and wherein, in response to detection of the retirementof the load operation, the first memory subunit is further to store thecounter value in an access latency field of the record, which isaccessible by software.
 3. The processor of claim 2, wherein theexecution engine unit further comprises a unified scheduler to: dispatchthe load operation for execution in response to the random selection ofthe load operation; and forward the re-order buffer identifier to there-order buffer in advance of detection of the write back, to signal tothe re-order buffer that the load operation is being monitored.
 4. Theprocessor of claim 1, wherein the first memory subunit is a memoryordering buffer and the piece of information is an unknown storeaddress.
 5. The processor of claim 1, wherein the first memory subunitis further to write a data access address of the load operation into therecord of the memory buffer, wherein the first memory subunit is amemory ordering buffer and the piece of information is whether the loadoperation is blocked due to an address collision with an earlier storeoperation.
 6. The processor of claim 1, wherein the first memory subunitis a data translation lookaside buffer and the piece of information isone of presence or absence of a miss of the data translation lookasidebuffer.
 7. The processor of claim 1, wherein the first memory subunit isa data cache unit and the piece of information is a cache latency valueof clock cycles for cache access.
 8. The processor of claim 1, whereinthe first memory subunit is further to write a value for an instructionpointer associated with the load operation into the record of the memorybuffer.
 9. A system comprising: a memory from which to retrieve data tocomplete load operations; a core comprising: a memory subsystem coupledto the memory, wherein the memory subsystem comprises a plurality ofmemory subunits, each containing a status register; and an executionengine unit coupled to the memory subsystem and to the core, theexecution engine unit to: randomly select a load operation to monitorfrom the load operations, the load operation associated with a threadcurrently executed by the core; determine a re-order buffer identifierof the load operation; and transmit the re-order buffer identifier tothe memory subsystem; wherein, responsive to receipt of the re-orderbuffer identifier, each of the plurality of memory subunits is to storea piece of information, related to a status of the load operation, inthe status register corresponding to respective memory subunit; andwherein the core is to: detect retirement of the load operation; and inresponse to detection of the retirement of the load operation, store thepiece of information from each status register into a correspondingfield of a record of a memory buffer, wherein the record is accessibleby core-executed software as performance monitoring data.
 10. The systemof claim 9, wherein the execution engine unit comprises a re-orderbuffer that generates the re-order buffer identifier, the re-orderbuffer comprising: a linear feedback shift register to generate a randomnumber that is to select the load operation; and an instruction latencycounter to: start incrementing a counter value responsive to a dispatchof the load operation; and stop the counter value responsive to a writeback of the load operation, from the memory subsystem, to the re-orderbuffer; and wherein, in response to detection of the retirement of theload operation, the core is further to store the counter value in anaccess latency field of the record, which is accessible by thecore-executed software.
 11. The system of claim 10, wherein theexecution engine unit further comprises a unified scheduler to: dispatchthe load operation in response to the random selection of the loadoperation; and forward the re-order buffer identifier to the re-orderbuffer in advance of detection of the write back, to signal to there-order buffer that the write back is for the load operation that isbeing monitored by the instruction latency counter.
 12. The system ofclaim 9, wherein the plurality of memory subunits comprises a memoryordering buffer, and in response to detecting the load operation isblocked by a preceding store forward operation with an overlappinglinear address, the memory ordering buffer is to set a bit of the statusregister of the memory ordering buffer.
 13. The system of claim 9,wherein the plurality of memory subunits comprises a memory orderingbuffer, and in response to detecting the load operation is blocked by anunknown linear store address, the memory ordering buffer is to set a bitof the status register of the memory ordering buffer.
 14. The system ofclaim 9, wherein the plurality of memory subunits comprises a datatranslation lookaside buffer for which the piece of information is oneof a hit or a miss of the data translation lookaside buffer.
 15. Thesystem of claim 9, wherein the plurality of memory subunits comprises adata cache unit for which the piece of information is a cache latencyvalue of clock cycles for cache access.
 16. A method comprising:randomly selecting, by an execution engine unit coupled to a memorysubsystem, a load operation to monitor; determining, by the executionengine unit, a re-order buffer identifier of the load operation;transmitting, by the execution engine unit, the re-order bufferidentifier to the memory subsystem; storing in a status register, by afirst memory subunit of the memory subsystem responsive to receipt ofthe re-order buffer identifier, a piece of information related to astatus of the load operation; detecting, by a processor that includesthe execution engine unit, retirement of the load operation; andstoring, by the processor in response to detecting the retirement of theload operation, the piece of information from the status register into aparticular field of a record of a memory buffer, wherein the particularfield is associated with the first memory subunit.
 17. The method ofclaim 16, further comprising: starting, by the execution engine unit, toincrement a counter value of an instruction latency counter responsiveto a dispatch of the load operation; stopping, by the execution engineunit, the counter value responsive to a write back of the loadoperation, from the memory subsystem, to the re-order buffer; andstoring, by the processor in response to the detecting the retirement ofthe load operation, the counter value in an access latency field of therecord, which is accessible by software.
 18. The method of claim 17,further comprising: dispatching, by a unified scheduler of the executionengine unit, the load operation in response to the random selection ofthe load operation; and forwarding, by the unified scheduler, there-order buffer identifier to the re-order buffer in advance ofdetection of the write back, to signal to the re-order buffer that thewrite back is for the load operation that is being monitored by theinstruction latency counter.
 19. The method of claim 16, furthercomprising writing, by the processor, a value for an instruction pointerassociated with the load operation into the record of the memory buffer.20. The method of claim 16, wherein the first memory subunit is a memoryordering buffer and the piece of information is whether the loadoperation is blocked due to an address collision with an earlier storeoperation, the method further comprising writing, by the processor, adata access address of the load operation into the record of the memorybuffer.
 21. The method of claim 16, wherein the first memory subunitcomprises a memory ordering buffer, and in response to detecting theload operation is blocked by a preceding store forward operation with anoverlapping linear address, setting, by the memory ordering buffer, abit of the status register of the memory ordering buffer.
 22. The methodof claim 16, wherein the first memory subunit comprises a memoryordering buffer, and in response to detecting the load operation isblocked by an unknown linear store address, setting, by the memoryordering buffer, a bit of the status register of the memory orderingbuffer.